Contents of INTERPRE

;;; -*- Mode: TEXT -*- ;;; File: AutoClass:doc;interpretation.text ;;;————————————————————————-;;; ;;; AUTOCLASS 3.0 Released 5/11/90 contact: Taylor@pluto.arc.nasa.gov ;;; ;;; by P. Cheeseman, J. Stutz, R. Hanson, W. Taylor ;;; ;;; NASA Ames Research Center, MS 244-17, Moffett Field, CA 94035 ;;; ;;; ;;; ;;; Copyright (C) 1990 Research Institute for Advanced Computer Science. ;;; ;;; All rights reserved. The RIACS Software Policy contains specific ;;; ;;; terms and conditions on the use of this software, and must be ;;; ;;; distributed with any copies. THIS FILE MAY BE REDISTRIBUTED. This ;;; ;;; copyright and notice must be preserved in all copies made of this file.;;; ;;;————————————————————————-;;; Interpretation of AutoClass Results: Now you have run AutoClass on your data set – what have you got? Typically, the AutoClass search procedure finds many classifications, but only saves the few best. These are now available for inspection and interpretation. The most important indicator of the relative merits of these alternative classifications is Log total posterior probability value. Note that since the probability lies between 0 and 1, the Log probability is negative and ranges from negative infinity and 0. The difference between these Log probability values raised to the power e gives the relative probability of the alternatives classifications. So a difference of say 100 implies one classification is e^100 more likely than the other. However, these numbers can be very misleading, since they give the relative probability of alternative classifications under the AutoClass ***assumptions***. Specifically, the most important AutoClass assumptions are the use of normal models for real variables, and the assumption of independence of attributes within a class. Since these assumptions are often violated in practice, the difference in posterior probability of alternative classifications can be partly due to one classification being closer to satisfying the assumptions than another, rather than to a real difference in classification quality. Another source of uncertainty about the utility of Log probability values is that they do not take into account any specific prior knowledge the user may have about the domain. This means that it is often worth looking at alternative classifications to see if you can interpret them, but it is worth starting from the most probable first. Note that if the Log probability value is much less than that for the one class case, it is saying that there is overwhelming evidence for ***some*** structure in the data, and part of this structure has been captured by the AutoClass classification. So you have now picked a classification you want to examine, based on its Log probability value; how do you examine it? The first thing to do is to generate an "influence" report on the classification using the report generation facilities documented in /doc/reports.text. An influence report is designed to summarize the important information buried in the AutoClass data structures. The first part of this report is a listing of the overall "influence" of each of the attributes used in the classification. These influence values are a weighted average of the "influence" of each attribute in each class, as described below. The next part of the report is a summary description of each of the classes. The classes are arbitrarily numbered from 0 up to n, in order of descending class weights. A class weight of say 34.1 means that the weighted sum of membership probabilities for class is 34.1. Note that a class weight of 34 does not mean that 34 cases belong to that class, since many cases may have only partial membership in that class. In the report on each class, the attribute parameters for that class are given in order of highest influence value. The "influence" value of an attribute in a class is an information measure that roughly indicates how informative that attribute is in describing a class – i.e., an indication of the distinguishing attributes for that class. Only the first few attributes usually have significant influence value. If an influence value drops below about 20